Written by Savahnna L. Cunningham
Date: October 17, 2017
The Red Wine dataset is publicly available for research. The details are
described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine
preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
The goal of this analysis is to quantify and gain an understanding of how
chemical properties impact the quality rating of red wine. The dataset contains
1599 red wine samples with 11 variables, quantifying the physicochemical
properties of wine. The wine samples in this dataset are related to red variants
of the Portuguese “Vinho Verde” wine.
A multiple regression analysis will be conducted on the dataset to test how
changes in the 11 independent physicochemical properties predict a level of
change in the quality rating of a wine. The f-test will be used to determine
which predictor variables merit inclusion in the model.
The statistical hypotheses for this analysis are as follows:
H0 (Null Hypothesis): Combinations of the 11 independent physicochemical
properties (μI) have no relationship in predicting the outcome of the dependent
quality rating of a wine (μD), which can be mathematically represented as
H0: μI = μD
H1 (Alternate Hypothesis): Two or more of the 11 independent physicochemical
properties (μI) predict the outcome of the dependent quality rating of a wine
(μD), which can be mathematically represented as
HA: μI > μD
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile
(do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high
of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness'
and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops,
it's rare to find wines with less than 1 gram/liter and wines with
greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents
microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low
concentrations, SO2 is mostly undetectable in wine, but at free SO2
concentrations over 50 ppm, SO2 becomes evident in the nose and taste
of wine.
8 - density: the density of water is close to that of water depending on
the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0
(very acidic) to 14 (very basic); most wines are between 3-4 on the
pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide
gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Summary table representing the 13 variable names. The X1 column represents the
wine ID. The ‘quality’ variable is the dependent variable and is qualitative
data based on a perceived like or dislike for the wine sample.
## X1 fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.43 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## NA's :2
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
##
Creating a Categorical Variable
The variable quality is of numeric type ‘int’ and not conducive for data
analysis. The first step will be to change the numeric type of the quality
variable to a factor and add it to the data frame as a new variable
quality.rating. Additionally, three categories of quality will be added:
good (>= 7), bad (<=4), and mediocre (5 and 6).
# Gained inspiration for this code from the R-Bloggers website[6&7].
wine$quality.rating <- factor(wine$quality)
wine$quality.rating <- NA
wine$quality.rating <- ifelse(wine$quality>=7,
'good', NA)
wine$quality.rating <- ifelse(wine$quality<=4,
'bad',
wine$quality.rating)
wine$quality.rating<- ifelse(wine$quality==5,
'mediocre',
wine$quality.rating)
wine$quality.rating <- ifelse(wine$quality==6,
'mediocre',
wine$quality.rating)
wine$quality.rating <- factor(wine$quality.rating,
levels = c("bad", "mediocre", "good"))
The visualizations represent the distribution of the dependent variable analyzed
in the dataset. The upper plot is a histogram of the raw wine quality score.
As you can see, most of the wine samples have a score between 5 and 6. The raw
quality data was transformed into categorical data to better analyze the
information. Score with values of 4 or less were labeled as “bad”, scores
between 5-6 were labeled as “mediocre” and a score with a 7 or higher was
labeled “good”. The bottom visualization depicts the categorical distribution of
the quality score. As you can see, nearly all wine samples fall into the
mediocre category with “good” samples having ~250 samples in the dataset and
“bad” wine being the least common.
The following independent variables have a normal or close-to-normal distribution:
fixed.acidity, volatile.acidity, density, pH and alcohol content with the
exception of citric acid, which has a bimodal distribution.
The residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and
sulphates variables do not have a normal or close-to-normal distribution. The
variables have right-skewed distributions, therefore, the data
will be transformed to near normality using a logarithmic function [4,13].
Synopsis
The red wine dataset contains 1599 red wine samples comprised of 11
physiochemical variables that affect a wine’s perceived quality.There were 5
physiochemical variables that had abnormal distributions. A logarithmic function
was used to better understanding of the distributions. The main features of
interest are the 11 independent variables and how they correlate to a wine’s
quality.
Density appears to have a small positive correlation with acids. Additionally,
pH has an inverse relationship with the acids, which is to be expected.
The following pairs of independent variables have a strong correlation (>0.5):
The exploratory data analysis will focus on the relationship between the
independent variables and the dependent quality rating variable.
The visualization indicates good wine contains a higher percentage of alcohol,
averaging ~11.5% by volume.
The visualization indicates good wine contains a higher concentration of
Tartaric Acid, averaging ~9 g/dm³ for good wine.
The visualization indicates good wine contains a higher quantity of citric acid.
As you can see, the quality greatly improves if a wine is has a citric acid
content range between 0.30 - 0.50 g/dm³, with the average citric acid
concentrationg ~ 0.35 g/dm³.
The visualization indicates there is not a large variance from a good and bad
wine with regards to Potassium sulphate concentration. As you can see, the
results show a good wine will have a Potassium sulphate mean concentration of
~0.75 g/dm³, while a bad wine will have a mean equal to ~ 0.55 g/dm³.
The visualization indicates that the greater the concentration of volatile acids
in a wine, the worse the quality rating. To have good marks, a wine is
considered good if it contains <0.4 g/dm³ acidic acid.
The visualization indicates that all wine samples are relatively close to the
density of water, averaging around 0.9975 g/cm³, however the wine samples with a
good quality rating have a slightly lower density, with a mean value equal to
~ 0.996 g/cm³.
The visualization shows mediocre and good wine have a pH value <= 3.25, in
comparison to a bad quality rating, which has a higher mean pH value >= 3.25.
The visualization represents the inverse relationship between pH and the weak
acids found in wine. Substances with a pH below 7.0 are termed acidic and
solutions with a pH above 7.0 are termed basic. As you can see, the red wine
samples as a whole are considered an acidic solution. As pH goes up, the less
acidic the wine becomes.
The visualization represents the affect acidity has on wine density. Acid
molecules are creating a stronger, closely packed bond compared to the
surrounding substance. Therefore, as acid molecules increase, the density of the
wine also increases.
Synopsis
The Bivariant analysis depicts notable relationships between wine quality and
the physiochemical characteristics. As you can see from the boxplots above,
there is a positive correlation between fixed acid, citric acid levels and wine
quality. The higher the non-volatile acid level, the better the wine quality.
Additionally, because acetic acid produces a vinegar taste, a negative
correlation can be found between the volatile acid variable and wine quality.
A good wine has the lowest density, which makes sense because density has a
direct correlation with total acidity concentration. However, it is interesting
to point out that there seems to be a fine line between total acidity level and
pH value. For a wine to be considered good, it has to have a low volatile
acidity level in conjunction with higher citric acid and fixed acid
concentrations but overall total acid levels should not pass a pH value of ~3.3.
The visualization compares the free Sulfur dioxide and the total Sulfur dioxide
variables to the dependent quality rating variable. As you can see, a strong
positive correlation exists between the dependent variable and the Sulfur
dioxide variables. The majority of good wine appears to have a free Sulfur
dioxide of <50 mg/dm³ and a total Sulfur dioxide concentration of
<100 mg/dm³.
The visualization compares the Fixed Acidity and pH variables with the dependent
quality rating variable. As you can see, there is an strong inverse relationship
between all of the variables. This is to be expected, as pH levels rise acidity
level decreases. Also of note, the good quality rating has a wide spread, even
distribution.
The visualization compares the Fixed Acidity and Citric Acid variables with the
wine quality rating. Results show a strong positive linear relationship between
the independent variables and wine quality. Interestingly, the majority of the
good quality data points cluster above a 0.25 g/dm³ citric acid value.
The visualization compares the Fixed Acidity and the Density variables with the
wine quality rating. Results indicate a strong positive correlation between the
independent variables, however, there does not appear to be any correlative
relationship between the independent and dependent variables. Interestingly,
the majority of the good quality data points are clustered at or below a
Tartaric Acid value of 0.4 g/dm³.
The goal of the multiple linear regression model is to predict wine quality
based on the chemical properties of a wine sample.
# Multiple Linear Regression
dataset = read.csv('wineQualityReds.csv')
dataset = dataset[, 2:13]
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$quality, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Note: Feature_Scaling will be taken care of with the function
# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = quality ~ .,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = quality ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.66781 -0.36656 -0.06195 0.45616 1.96562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.471e+01 2.370e+01 0.621 0.534945
## fixed.acidity 2.265e-02 2.878e-02 0.787 0.431501
## volatile.acidity -9.534e-01 1.347e-01 -7.078 2.41e-12 ***
## citric.acid -1.259e-01 1.619e-01 -0.778 0.436697
## residual.sugar 1.043e-02 1.627e-02 0.641 0.521547
## chlorides -1.932e+00 4.586e-01 -4.213 2.70e-05 ***
## free.sulfur.dioxide 3.379e-03 2.487e-03 1.359 0.174485
## total.sulfur.dioxide -3.005e-03 8.114e-04 -3.704 0.000222 ***
## density -1.067e+01 2.418e+01 -0.441 0.659225
## pH -4.486e-01 2.161e-01 -2.075 0.038143 *
## sulphates 8.889e-01 1.311e-01 6.778 1.86e-11 ***
## alcohol 2.917e-01 2.975e-02 9.804 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6519 on 1266 degrees of freedom
## Multiple R-squared: 0.3519, Adjusted R-squared: 0.3462
## F-statistic: 62.48 on 11 and 1266 DF, p-value: < 2.2e-16
#Predicting the Test set results
y_pred = predict.lm(regressor, newdata = test_set,interval = "prediction",level = 0.95)
p1 <- smoothScatter(y_pred,pch = ".", cex = 5,
col = "black",colramp =
colorRampPalette(c("white", blues9)),
xlab = "Fit",
ylab = "Model Prediction",
main="Predicted Future Values ")
The visualization represents the 95% prediction interval with data points
representing the models predicted values. As you can see, the model did very
well predicting the wine quality value, as all data points are within the
prediction interval [10].
# Plot a correlation matrix
regressor= cor(test_set[1:12])
par(mar=c(5,4,1.5,2) + 0.1) #margin padding
p1 <- corrplot(regressor, method = "circle",tl.cex = 0.6) +
title(main= "Regression Model Correlation Matrix",cex.main = 1.3)
The volatile acidity, chlorides, total sulfur dioxide, alcohol, sulphates have
strong statistical significance on the depandent variable, while pH has a slight
statistical influence on wine quality. The model did very well, now it is time
to optimize it with the Backward Elimination method.
# Building the optimal model using Backward Elimination
regressor = lm(formula = quality ~ fixed.acidity +
volatile.acidity +
citric.acid +
residual.sugar +
chlorides +
free.sulfur.dioxide +
total.sulfur.dioxide +
density +
pH +
sulphates +
alcohol,
data = dataset)
summary(regressor)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.197e+01 2.119e+01 1.036 0.3002
## fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
## volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
## citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
## residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
## chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
## free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
## total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
## density -1.788e+01 2.163e+01 -0.827 0.4086
## pH -4.137e-01 1.916e-01 -2.159 0.0310 *
## sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
## alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide +
## pH + sulphates + alcohol, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60575 -0.35883 -0.04806 0.46079 1.95643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2957316 0.3995603 10.751 < 2e-16 ***
## volatile.acidity -1.0381945 0.1004270 -10.338 < 2e-16 ***
## chlorides -2.0022839 0.3980757 -5.030 5.46e-07 ***
## total.sulfur.dioxide -0.0023721 0.0005064 -4.684 3.05e-06 ***
## pH -0.4351830 0.1160368 -3.750 0.000183 ***
## sulphates 0.8886802 0.1100419 8.076 1.31e-15 ***
## alcohol 0.2906738 0.0168108 17.291 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared: 0.3572, Adjusted R-squared: 0.3548
## F-statistic: 147.4 on 6 and 1592 DF, p-value: < 2.2e-16
A multiple linear regression model was conducted on the dataset using the
backward elimination method. The findings indicate six independent variables
have a high statistical influence (p < 0.05) on the quality of a wine. A violin
plot was used to visualize the descriptive statistics of each influential
variable.
Take notice that in some cases, as with total Sulphur dioxide, chlorides, and pH
the distance between a “good” vs. “bad” wine is minute. However, the three
independent variables with the greatest statistical influence, alcohol, volatile
acidity and sulphates, do have a noticeable distance in mean quality rating
values. The results indicate that a good wine will have a high alcohol
percentage, low volatile acid concentration and a Potassium sulphate
concentration of ~15 g/dm³.
The visualization compares the two independent variables with the highest
statistical influence on wine quality. The results show wine quality has a
negative correlation with the independent variables, with the strongest negative
correlation seen in the 2,4,7,& 8 quality values.
Synopsis
The Multivariate analysis reveals strong statistical correlations with six of
the independent physicochemical properties. The scatterplot visualizations
indicate that “good” wine will have low concentrations Citric Acid, Tartaric
Acid, total Sulfur dioxide and free Sulfur dioxide.
An optimized multiple linear regression model using the Backward Elimination
method discovered alcohol, volatile acidity, sulphates, total sulfur dioxide,
chlorides and pH have a very strong statistical influence on wine quality.
The univariate analysis revealed six independent physicochemical properties have
normal or close-to-normal distributions, while the remaining five properties
have right-skewed distributions, requiring a logarithmic function be used to
better understand the distributions.
The bivariate analysis revealed a positive linear relation between the
independent physicochemical properties alcohol, fixed acidity, citric acid and
sulphates and the dependent variable. A negative linear relationship exists
between volatile acidity, density, pH and the quality rating. An inverse
relationship exists between pH and the acids, as pH levels rise as acid levels
decrease. Additionally, there is a positive correlation between density and the
acids due the chemical properties that exist between an acid molecule and the
surrounding substance. Strong correlations (>0.5) were discovered between free
Sulfur dioxide and total Sulfur dioxide, fixed acidity and density, fixed
acidity and pH, as well as fixed acidity and citric acid.
A multiple linear regression model was used on a test set of 321 wine samples,
containing 11 independent variables to predict wine quality. The model performed
very well with a 95% Confidence Interval, p-value <2.2e-16, residual standard
error of 0.6519 on 1266 degrees of freedom, and a F-statistic equal to 62.48 on
11 variables and 1266 DF, concluding the 11 variables account for 35.48% of the
variance in wine quality.
A second multiple linear regression model utilizing the Backward Elimination
method was conducted on a test set of 321 wine samples to optimize the predictor
variables to determine which variables have the strongest statistical
relationship with the dependent variable.
Optimized Multiple Linear Regression Summary:
The Backward Elimination method indicates alcohol, volatile acidity, sulphates,
total sulfur dioxide, chlorides and pH physicochemical properties have a very
strong statistical influence on wine quality with a 95% Confidence Interval, a
p-value <2.2e-16 and a residual standard error of 0.6487 on 1592 degrees of
freedom. These six physicochemical properties account for 34.95% of the variance
of wine quality. The high F-statistic equal to 147.4 and small p-value of
< 2.2e-16 gives sufficient statistical evidence that the six independent
variables predict the quality rating of wine, therefore, the Null Hypothesis
can be rejected.
Plot One
The quality variable was of numeric type ‘int’ and not conducive for data
analysis. The variable was transformed into a categorical variable. The majority
of the wine samples fall into the “mediocre” quality rating.
Plot Two
The alcohol and sulphate variables have a positive correlation with the quality
variable. Volatile acidity,chlorides and pH have a negative correlation with the
quality rating. Intrestingly, total sulphur dioxide has a normal distribution
with the quality variable. Results show that a good wine will have a high
alcohol percentage, low volatile acid concentration and a Potassium sulphate
concentration of ~15 g/dm³.
Plot Three
The results from the multiple linear regression model show that the alcohol and
volatile acidity have the strongest statistical influence on wine quality. This
plot compares these two variables against wine quality. As you can see, wine
quality has a negative correlation with the independent variables.
The exploratory data analysis revealed the distributions of the 11 independent
variables, as well as the interactions the physicochemical properties have with
each other. The multivariate analysis focused on the independent variables with
strong correlations (>0.5), results showing the fixed acidity variable, with
three relations, has the greatest number of correlative influence on other
independent variables.
The dependent wine quality variable has a normal distribution with most samples
having a 5-6 quality score. The alcohol content and volatile acidity
concentration have the strongest statistical influence on the dependent variable
with sulphates, total sulfur dioxide, chlorides and pH also having an influence
on the quality of a wine. Interestingly, when the two most influential
variables, alcohol content and volatile acidity, are compared with the wine
quality rating variable results show good wine has an alcohol content >11.5% by
volume and an Acetic Acid concentration <0.5 g/dm³. Future work on this
dataset should include exploring the outliers in this analysis. Why does a wine
with a high alcohol percentage and a high Acetic Acid concentration still
considered a good wine? Is there a unique combination of physicochemical
properties within these samples, which lead to these abnormal quality ratings?
This analysis used a multiple linear regression model to account for 34.95% of
the variance of wine quality. To improve the predictive power of the
mathematical algorithm additional data with a wider spread of quality data
should be used to improve performance results. Moreover, additional predictive
models should be employed, such as Support Vector Machine (SVM), Decision Tree
Regression or K-Nearest Neighbors (KNN) to provide more accurate predictions for
a wine’s quality as a function of the independent physicochemical properties.
The multiple linear regression model determined the independent physicochemical
properties with the highest statistical influence on wine quality are alcohol
percentage, volatile acidity, sulphates, total sulfur dioxide, chlorides and pH.
Sulphates are added to wine and act as an antimicrobial and antioxidant,
signifying good wines will have a Potassium sulphate concentration of
~0.6 g/dm³. Furthermore, it was discovered good wines contain low quantities
of chlorides, total Sulfur dioxide and pH. The independent variables that have
the maximum statistical influence on a wine quality are volatile acidity and
alcohol percentage. Therefore, good wines will consist of a high alcohol
percentage and a low concentration of volatile acids, which give wines an
unpleasant, vinegar taste. This analysis exposes a strong correlative
relationship between the physicochemical properties of alcohol and volatile
acidity, thus, demonstrating the importance of a wine to be free of
imperfections.
I chose the Red Wine Quality dataset to get a better understanding of how the
physiochemical compounds found in wine affect a wine’s quality. Home brewing
wine is on my bucket list and one of the goals of this data analysis was to
learn what makes a quality wine so that I may implement my findings in the
future. I consider this a success data analysis because I was able to explore
the independent variables and compare them with the dependent quality variable
and gain a clear understanding of the influences that affect wine quality.
The challenge I experienced with this analysis involved implementing the
mathematical algorithm correctly. I initially wanted to create three different
algorithms, a Support Vector Machine (SVM), Decision Tree Regression and
K-Nearest Neighbors (KNN) model and compare the results to provide more accurate
predictions for a wine’s quality as a function of the independent
physicochemical properties. I’m not as familiar with R as I am with Python, and
learning the syntax for the machine learning algorithm took too much time,
therefore I decided to simplify matters and conduct a multiple linear regression
model. The linear regression model did very well and I’m proud of how successful
it performed; the next step in the project would be to implement the SVM, KNN
and Decision Tree Regression algorithms for a more robust machine learning
analysis.
It was exciting to investigate the independent variables for relationships
affecting wine quality. I learned how to make beautiful multivariate graphs and
I created my first multiple linear regression model using R. Overall, I’m
extremely proud of the work I have accomplished with this project.
1. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis,
Modeling wine preferences by data mining from physicochemical properties,
In Decision Support Systems, Volume 47, Issue 4, 2009, Pages 547-553,ISSN
0167-9236, https://doi.org/10.1016/j.dss.2009.05.016.
(http://www.sciencedirect.com/science/article/pii/S0167923609001377)
2. Dataset link: http://www3.dsi.uminho.pt/pcortez/dss09.bib
3. http://r4stats.com/examples/graphics-ggplot2/
4. http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_
scales_in_my_charts_and_graphs
5. https://www.r-bloggers.com/multiple-regression-lines-in-ggpairs/
6. https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html
7. https://www.r-bloggers.com/from-continuous-to-categorical/
8. http://www.shonscience.com/unit-1-earth-as-a-system2/does-the-shape-size
-or-temperature-of-matter-affect-its-density
9. https://machinelearningmastery.com/pre-process-your-dataset-in-r/
10. http://www.stat.columbia.edu/~martin/W2024/R6.pdf
11. http://data.library.virginia.edu/diagnostic-plots/
12. https://www.stat.berkeley.edu/classes/s133/Lr.html
13. http://www.public.iastate.edu/~maitra/stat501/lectures/Outliers.pdf